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Abstract. Graph construction is a crucial step in spectral clustering 
(SC) and graph-based semi-supervised learning (SSL). Spectral meth- 
ods applied on standard graphs such as fuU-RBF, e-graphs and fc-NN 
graphs can lead to poor performance in the presence of proximal and 
unbalanced data. This is because spectral methods based on minimizing 
RatioCut or normalized cut on these graphs tend to put more impor- 
tance on balancing cluster sizes over reducing cut values. We propose 
a novel graph construction technique and show that the RatioCut solu- 
tion on this new graph is able to handle proximal and unbalanced data. 
Our method is based on adaptively modulating the neighborhood degrees 
in a fc-NN graph, which tends to sparsify neighborhoods in low density 
regions. Our method adapts to data with varying levels of unbalanced- 
ness and can be naturally used for small cluster detection. We justify 
our ideas through limit cut analysis. Unsupervised and semi-supervised 
experiments on synthetic and real data sets demonstrate the superiority 
of our method. 

Keyvirords: Adaptive graph sparsification, small cluster detection 



1 Introduction and Motivation 

Graph-based approaches are popular tools for unsupervised clustering and semi- 
supervised learning(SSL). In these approaches, a graph representing the data 
set is first constructed. Then a graph-based learning algorithm such as spectral 
clustering(SC) [3] or SSL algorithms [516] is applied on the graph. Of the two 
steps, graph construction has been identified to be critical [5l7l2l8l9j . Effective 
graph construction strategies turn out to be even more critical in the presence 
of unbalanced and proximal data. Unbalanced data arises routinely in many 
applications including multi-mode(class) clustering and SSL tasks. The focus of 
this paper is on graph construction for spectral methods and we refer to [lOJ for 
model-based approaches. 

Common graph construction methods include e-graph, fully-connected RBF- 
weightcd(full-RBF) graph and /c-nearest neighbor(fc-NN) graph, e-graph links 
two nodes u and v if d{u, v) < e. FuU-RBF graph links every pair of nodes with 
RBF weights w{u,v) = exp{—d{u, v)'^ /2<t'^), which is in fact a soft threshold((T 
serves similarly as e). /c-NN graph links u and if ti is among the k closest 
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neighbors of u or vice versa. It is the niost recommended method |7I5| due to its 
relative robustness to outliers. In [5] the authors propose ^-matching graph. This 
method is supposed to eliminate some of the spurious edges of fc-NN graph and 
lead to better performance ^Qj. 

However, for unbalanced and proximal data clusters, SC and graph-based 
SSL algorithms appear to perform poorly on these conventional graphs. This 
poor performance is a result of minimizing RatioCut objective on these graphs. 
For unbalanced and proximal data clusters the RatioCut objective on these 
graphs tends to put more importance on balancing cluster sizes over reducing 
cut values. This sometimes leads to cuts that are not meaningful. In Section 2 we 
will investigate the fundamental reasons that lead to poor results. We will then 
outline a novel graph construction strategy, whereby the RatioCut objective on 
this new graph is able to handle varying levels of proximal and unbalanced data. 
Our rank-modulated degree (RMD) graph construction method, described in 
detail in Section 3, is based on modulating the degrees in a fc-NN graph. The 
impact of this strategy is that it results asymptotically in more edges per node 
in high-density regions and a sparsification near density valleys. We explore the 
theoretical basis for these results in Section 4. In Section 5 we present several 
experiments on synthetic and real datasets and show significant improvements 
in SC and SSL results over conventional graph constructions. 



2 Proximal & Unbalanced Data Clusters 

In this section we will investigate some of the reasons that lead to poor SC 
and SSL performance for conventional graph constructions in the presence of 
proximal and unbalanced data. We draw upon existing results to justify our 
reasoning. 

Let G = {V, E) be the graph constructed from n samples drawn IID from 
some underlying density /(a;), where x € W^. Let (C, C) be a 2-partition of the 
nodes separated by a hyper surface S. The simple cut is defined as: 

Cut{C,C)= w{u,v), (1) 

u£C .v&C ,{u.v)£E 

where wiu, v) is the weight of edge (u, v) e E. Spectral clustering techniques are 
based on minimizing RatioCut: 

RatioCutiC C) = CutiC C) ( + j^) > (2) 

where |C| denotes the number of nodes in C. A variant of RatioCut is the so 
called normalized cut (NCut). Our discussions for RatioCut also extend to NCut 
and we will not discuss NCut from here on. Note RatioCut augments the simple 
Cut with a balancing term, which desensitizes partitions from outliers. 
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Fig. 1. Various graphs and SC results. Cut and RatioCut values of (c),(e) are averaged 
over 20 Monte Carlo runs. The values are re-scaled here for demonstration, dk is the 
average fc-NN distance, n = 1000, k = 30. For (b) unweighted RMD graph with 
l = 30,X = 0.4; for (d) unweighted fc-NN; for (f) e = cr = 4 is used. 

Unbalanced Proximal Gaussian Mixture: By means of an example, we 
will argue that minimizing RatioCut on conventional graphs has fundamental 
drawbacks for clustering proximal and unbalanced datasets. For our illustrative 
experiment we consider n = 1000 data samples drawn IID from a proximal and 
unbalanced 2-D gaussian mixture density, 



2 

i=l 



(3) 
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where ai=0.9, Q!2=0.1, /zi = [4.5;0], /i2=[0;0], Si — diag{2,l), S2 — I, as shown 
in FiglU We examine different graph constructions including full-RBF, (RBF) 
fc-NN and e-graph. Note that these graph constructions are parameterized by k, 
a, €. Our SC results here are depicted for reasonable choices of these parameters. 

A balanced cut in this case is approximately a line parallel to X2 axis passing 
through = 4. A cut at the density valley is approximately a line parallel to 
X2 axis passing through a;i = 1. For the SC to seek a cut at the valley we would 
need the RatioCut to achieve its minimum at xi = 1. 

The re-scaled simple Cut curve in (c),(e) shows that the Cut value is rela- 
tively large at xi = 1 due the fact that the density valley is "shallow." Fig. 1(c) 
shows RatioCut values for RBF fc-NN for large and small tr values. Large a (un- 
weighted fc-NN behaves similarly) achieves minimum at the balanced position 
{xi « 4); while small a pulls down RatioCut near the boundaries and turns out 
to be vulnerable to outliers, (e) shows fhat full-RBF(e-graph behaves similarly) 
with large a tends to smooth out the curve and is insensitive to location of 
the valley, while small a appears to be vulnerable to outliers. In contrast our 
method, RMD, appears to be able to reject outliers and achieves minimum Ra- 
tioCut close to the valley position. 



Graph Partitioning, Cut-values, and Cluster Sizes: By varying Ui in 
Eq.Q we can vary the size of unbalanced clusters; varying /ii,CTi has the ef- 
fect of varying proximity of the clusters. For a given value of a^, /ii, ct^, we let Sjj 
be the locus of points corresponding to the density valley (for example in Fig.l 
this is the line xi = 1), and Sb any line that asymptotically results in two bal- 
anced partitions (for example in Fig.l this is the line xi =4). Now for a graph 
G = {V,E), the lines Su and Sb describe two different partitions, one unbal- 
anced but respecting the inherent clustering of data and the other balanced but 
not respecting the underlying data clusters. We denote by Cu,Cu the partitions 
resulting from a cut associated with the line Su and by Cs, Cb the partitions 
resulting from a cut associated with the line Sb B The Cut-ratio q is defined as 
the ratio of the Cut values corresponding to the two partitions; y denotes the 
size of unbalanced partition, namely, 

'= cutiCB,CBy y = ---{i^^i'i^^i} (4) 

Now we examine the condition when the natural unbalanced partition has a 
smaller RatioCut value than the balanced partition. This requires that, 

CutiCu,Cu)i— + \ ) < Cut{CB, Cb){^ + ^) =^ q < 4y (1 - y) 
yn [1 — yjn n/2 n/Z 

(5) 

where we have substituted for q from Eq.Q. A plot of the Cut-ratio q for different 
unbalanced proportions y is shown in Fig. 2. 



^ data samples situated exactly on the line Su or Sb are randomly assigned. 
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Consequently, Fig. 2 and Eq. ^ points 
to a fundamental aspect of RatioCut for 
datasets with unbalanced and proximal 
clusters. If the tuple {q, y) lies above 
the curve, RatioCut value is smaller 
for balanced partitions than partition- 
ing at density valley (note y k, 0.1 
required Cut-ratio can be as small as 
0.36). 



Why do conventional graphs fail? 

This is best explained by understanding 
the limit-cut analysis results for /c-NN, e- 
graph and full-RBF graphs |2lllj . For ap- 
propriately chosen parameters, fc, a and 
e respectively, as the number of samples 
n — >■ oo, the Cut ratio q and the unbal- 
anced cluster size y converges (with high 
probability) to: 

n-s-oo Ssu f'^i''^)^^ n—. 

^ f^{x)dx^ ^ ~ 

where fj,{Cu), l^{Cu) are the volumes (probability) of sets Cu and Cu under 
density f{x) respectively. 7 is a constant and depends on the specific graph 
construction. While standard graph construction methods do account for the 
underlying density /(cc), this by itself is insufficient for proximal and unbal- 
anced clusters. For the mixture Gaussian case (Eq.®) it follows from Eq.® 
that q can be relatively large for an appropriate choice of /i^, ai and a fixed 
choice of unbalancedness, y. Note, y, is predominantly controlled through mix- 
ture proportions a^. Eq.® and Fig. 2 asserts that in this case RatioCut has a 
smaller value for balanced partitions even when density valley cut, Sjj, is the 
natural choice. 

Parameter tuning: It is possible that the parameters fc, ct, and e can be tuned 
to account for unbalancedness. However, large values of fc, a and e tends to 
smooth the underlying distribution (see Fig. 1) and increases the Cut-ratio, 
which worsens the problem. In contrast decreasing fc, cr and e below well-understood 
acceptable thresholds (see |2I11) ') leads to disconnected graphs and sensitivity to 
outliers (this is also seen in Fig. 1). While changing parameters fc, cr, e can globally 
modify the graph topology, this has poor control over Cut-ratio. For instance, 
increasing/decreasing k results in a fc-NN graph with uniformly larger/smaller 
number of neighbors for all the nodes and uniformly larger/smaller Cut values 
for any cut, leading to poor control of Cut-ratio. 

Controlling Cut Ratio through Graph Sparsification: From the above 
discussion it is clear that we need to directly control Cut-ratio. We do so by 
adaptively sparsifying graph neighborhoods. Neighborhoods for nodes in plausi- 




0.1 0.2 0.3 0.4 0.5 

y ( smaller cluster proportion ) 



Fig. 2. Cut-ratio (g) vs unbalanced 
cluster size {y). Ratio Cut value is 
smaller for balanced cuts over natural 
unbalanced cuts whenever the cut-ratio 
is above the curve. 



m^n{^l{Cu), ^l{Cu)} (6) 
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ble low-density regions are sparsified and those in high-density regions are "den- 
sified" . By controlling this sparsification/densification the Cut-ratio is controlled 
and adapted to varying degrees of unbalancedness and proximity. Comparisons 
between standard constructions and our RMD graph for the Gaussian mixture 
of Eq.Q are shown in Fig. 1. As seen our method sparsifies low density regions 
in contrast to other methods. 

3 RMD Graphs: Main Steps 

Given data samples {xi, . . . , a;„} in R'', our rank-modulated degree(RMD) graph 
based learning involves the following steps: 

(1) Rank Computation: The rank R{x) of every point x is calculated: 

1 " 

^i^) = -Y.hG{x)<G{x^)} (7) 
i=l 

where I denotes the indicator function. Ideally we would like to choose G{-) to be 
the underlying density, /(•) of the data. Since / is unknown, we need to employ 
some surrogate statistic. While many choices are possible, the statistic in this 
paper is based on fc-nearest neighbor distances. Such rank based statistics have 
been employed for high-dimensional anomaly detection |18ll9j . More choices for 
G and a robust procedure for computing R(x) are described in Sec l3.1l The rank 
is a normalized ordering of all points based on G, ranges in [0, 1], and indicates 
how extreme the sample point x is among all the points. 

(2) RMD Graph Construction: Connect each point x to its deg(a::) closest 
neighbors. The number of neighbors deg(a;) for point x is modulated as follows: 

deg{x) = k{X + 2{l- X)R{x)) (8) 

where, A is a scalar parameter that will be optimized later. Here k is the average 
degree, A G [0, 1] controls the minimum degree. It is not difhcult to see that R{x) 
converges (in distribution) to a uniform measure on the unit interval regardless 
of the underlying density /(•) if G(-) is bijective. This implies that the expected 
value converges to 0.5. Consequently, the average degree across all samples is 
k. Furthermore, the above modulation scheme can be thought of as modulating 
the degree of each node around a nominal value equal to k. The remaining issue 
is to optimize over the scalar parameter A, which is described in Step (4). 

(3) Graph-based Learning: The third step involves using RMD graph in a 
graph-based clustering or SSL algorithm. Spectral clustering algorithms based on 
RatioCut for 2-class and multi-class clustering are now well established. For SSL 
algorithms we employ Gaussian Random Fields(GRF) and Graph Transduction 
via Alternating Minimization(GTAM). These approaches all involve minimizing 
Tr{F^ LF) plus some constraints or penalties, where F is the cluster indicator 
function or classification (labeling) function, L is the graph Laplacian matrix. 
This has been shown to be equivalent to minimizing RatioCut (NCut) for unnor- 
malized(normalized) L pTJI. We refer readers to references [71516] for details. 
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(4) Optimization over A: Our final step is to optiniize over A G [0, 1]. Our 
main assumption is that we have prior knowledge that the smallest cluster is at 
least of size Sn. We consider the 2-cluster case first. The 2-partitions resulting 
from spectral clustering algorithms are now parameterized by A: (C(A), C'(A)) . 
We now optimize the minimum Cut value over all admissible A such that the 
smallest cluster is no smaller than some threshold S: 

J{6) = mmxelos]{Cut{C{X), C{X)} (9) 
s.t. min{|C(A)|,|C'(A)|} > (5n 

6 sets the threshold of minimum cluster size, which means clusters of smaller 
sizes than Sn are viewed as outliers and will be discarded. Algorithms for K- 
partition clusters and SSL algorithms can be extended in a similar manner by 
optimizing suitable objective functions in place of the 2-partition cut value. Note 
that a similar optimization step can also be applied to select the best k and a 
with traditional graph constructions as well. We will employ this strategy for 
the purpose of comparison on real data sets in Sec l5.2l 

3.1 Rank Computation 

The missing component in our RMD method is the specification of the statistic 
G. We choose the statistic G in Eq.© based on nearest- neighbor distances. 
Specifically, 

where (x) denotes the distance from x to its i-th nearest neighbor, and G is 
the average of x's ^-th to ^"^^ nearest neighbor distances. Other choices for G 
are listed below. 

(1) e-Neighborhood: G{x) is the number of neighbors within an e-ball of x. 

(2) ^-Nearest Neighorhood: G{x) is the distance from x to its l-th nearest neigh- 
bor. 

Empirically (and theoretically) we have observed that the average nearest 
neighbor distance leads to better performance and robustness. To reduce variance 
during rank computation we adopt a U-statistic resampling technique ^14] with 
B resamplings. 

U-statistic Resampling For Rank Computation: 

Given N — 2m data points, 

(a) Randomly split the data into two equal parts: 5*1 = {xi, ...,Xm}, S2 = 

(b) Points in 5*2 are used to calculate G for Xi S 6*1 according to Ea.([TO|. and 
vice versa. 

(c) Ranks of Xi e Si are computed by Eq.([7]) within Si and similarly for Xi € ^2. 

(d) Resplit the data and repeat the above steps B times. Let Rb(xi) be the rank 
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of Xi obtained from the 6-th resampling. We then use the average as the final 
rank: 

1 ^ 

R{^^)^^Y.^''{X^), l^l,2,...,N (11) 

6=1 

Properties of the Ranked Data: 

(1) High/Low Density Indicator: The value of R{x) is a direct indicator of 
whether x lies in high/low density regions(Fig|3]). 

(2) Smoothness: R{x) is the integral of pdf asymptotically (see ThmUin Sec]?]). 
It's smooth and uniformly distributed in [0,1]. This makes it appropriate to 
modulate the degrees with control of minimum, maximal and average degree. 

(3) Precision: We do not need our estimates to be precise for every point; the 
resulting cuts will typically depend on relatively low ranks rather than the exact 
value, of most nearby points. 

3.2 Sahent Properties of RMD Graphs 

Our scheme successfully solves the follow- 
ing issues: 

(1) Captures density valley: The 

monotonicity of deg(a;) in R(x) immedi- 
ately implies that nodes in low/high den- 
sity areas will have fewer /more edges, thus 
reducing cut-ratio q in Figl2]and ensuring 
that the RatioCut has low values at den- 
sity valleys. 

(2) Robustifies against Outliers: The 

minimum degree of nodes in RMD graph 
is kX, even for distant outliers. Further- 
more, A is the solution to the optimization 
step (see Eq. 13), and so is robust to out- 
liers as shown in Figjljc), where the Rati- 
oCut curve of RMD graph(black) goes up 
near boundaries, guaranteeing the valley minimum is the global minimum. 

(3) Adapts to Unbalanced Clusters: The optimization problem of Eq.© 
leads to sizable clusters that can be unbalanced. The reason is that small values 
of A emphasize the Cut value over the balancing term. This has the effect of 
preferring smaller Cut values with possibly unbalanced partitions over balanced 
partitions with larger Cut values. This effect is magnified because smaller A 
leads to sparser connections at low-density areas. Since the balancing term is 
not impacted, varying A from 1 to moves the partition from the relatively 
balanced position toward the density valley (see also ThmI5]in SecH]). Practically, 
A provides a flexibility to optimize the tradeoff between the simple Cut and the 
cluster size. The cluster-size threshold S in the optimization step (Eq.®) is 
used to constrain clusters that are not too small, thus avoiding outliers. We can 
also iterate over S to find possibly different valley cuts of different sizes. This 
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Fig. 3. Density level sets & rank es- 
timates for unbalanced and proximal 
gaussian mixtures. High/low ranks cor- 
respond to high/low density levels. 
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procedure can sometimes be used for size-constrained clustering ^15) . We will 
demonstrate some of these ideas in Sec. 5. 3. 

4 Analysis 

The proofs of theorems here appear in the Appendix section. Assume the data 
set {xi, . . . , Xn} is drawn i.i.d. from density / in R"*. / has a compact support 
C. Let G = (y, E) be the RMD graph. Given a separating hyperplane S, denote 
C~^,C~ as two subsets of C split by S, rjd the volume of unit ball in W^. 

First we show the asymptotic consistency of the rank R{y) of some point 
y. The limit of R{y), p(jj), is the complement of the volume of the level set 
containing y. Note that p exactly follows the shape of /, and always ranges in 
[0, 1] no matter how / scales. 

Theorem 1. Assume the density f satisfies some regularity conditions. For a 
proper choice of parameters ofG, as n — > oo, we have 

R{y)-^p{y) / f{^)dx. (12) 

Next we study RatioCut induced on unweighted RMD graph(similar for 
NCut). The limit cut expression on RMD graph involves an additional adjustable 
term which varies according to the density. This implies the Cut values in high 
density areas can be significantly more expensive than in low density areas. No- 
tice that this effect becomes stronger when A varies from 1 to 0, which means 
the minimum will be attained at even smaller density areas. For technical sim- 
plicity, we assume RMD graph ideally connects each point x to its deg(a;) closest 
neighbors. 

Theorem 2. Assume the smoothness assumptions in JSl hold for the density f, 
and S is a fixed hyperplane in M''. For unweighted RMD graph, set the degrees of 
points according to Eq.(^, where A € (0, 1) is a constant. Let p{x) = A -|- 2(1 — 
\)p{x). Assume fc„/n — )■ 0. In case d—1, assume kn/\/n — > oo; in case d >2 
assume /c„/logn — !■ oo. Then as n ^ oo we have that: 

^{[^RatioCutniS) f f^-i{s)p{sy+ids{fi{C+)-^ + fi{C-)-^) . 

(13) 

where Cd = . , Ji^T+i/d ; Ai(C=^) = /p± fix)dx. 

Compared to the limit expression on fc-NN graph([2|), there is an additional 
term p{x) = (A + 2(1 — X)p{x)) here. To see the impact suppose A is small; we 
see that for S near modes, p(x) sa 1 and this extra term is nearly (2)^+^. For S 
passing valleys this term is nearly (A)^+3 < 1. So graph-cut value near modes 
are penalized more than valleys. 
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(a) fe-NN (b) &-matching 
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(c) e-graph(full-RBF) (d) RMD 

Fig. 4. Graphs and clustering results of SC on 2 moons and 1 gaussian data set. SC on 
full-RBF(e-graph) completely fails due to the outlier. For fc-NN and 6-matching graphs 
SC cannot recognize the long winding low-density regions between 2 moons, and fails 
to find the rightmost small cluster. Our method significantly sparsifies the graph at 
low-density regions, enabling SC to cut along the curved valley, detect the small cluster 
and is robust to outliers as well. 



5 Simulations 

Many of the examples in this section focus on the unbalanced datasets. Unbal- 
anced data is obtained by sampling the data set in an unbalanced way. Some 
general simulation parameters are: 

(1) In U-statistic rank calculation fSec l3.ip . we fix the resampling time B — h. 

(2) All error rate results are averaged over 20 trials. 
Other parameters will be specified below. 



5.1 Multi- Cluster Complex-Shaped Clusters 

Consider a data set composed of 1 small Gaussian and 2 moon-shaped proximal 
clusters shown in Fig|31 Sample size n = 1000 with the rightmost small cluster 
10% and two moons 45% each. In this example, for the purpose of illustration, 
we did not optimize A or any of the other parameters. We fix A = 0.5, and 
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choose k ~ I — 30, e ^ a = dk, where dk is the average fc-NN distance. On 
fc-NN and 6-matching graphs SC fails for two reasons: (1) SC cuts at balanced 
positions and cannot detect the rightmost small cluster; (2) SC cannot recog- 
nize the long winding low-density regions between 2 moons because there are 
too many spurious edges and the Cut value along the curve is big. SC fails on 
e-graph(similar on full-RBF) because the outlier point forms a singleton clus- 
ter, and also cannot recognize the low-density curve. RMD graph significantly 
sparsifies the graph at low-density regions, enabling SC to cut along the winding 
valley, detect the small cluster and is robust to outliers. Naturally, these results 
depend on choices of A:, a, and e. However, our choices represent the best case 
scenarios for these methods and we did not see any significant improvements by 
varying these parameters. 

5.2 Real DataSets 

We focus on unbalanced settings and consider several real data sets. We construct 
fc-NN, 6-match, full-RBF and RMD graphs all combined with RBF weights, but 
do not include the e-graph because of its overall poor performance. For fairness 
of comparison, we vary not only A of RMD but also k, a under the optimization 
step in SeclSl For example, the result of RBF fc-NN graph is chosen based on 
optimizing the following expression: 

J{d) = mmk,a{Cut {C{k,(j),C{k,cr))} (14) 
s.t. min{|C(/c,cr)|,|C'(fc,cr)|} > 5n 

where, C(fc, cr), C{k,a) denotes the RatioCut partition obtained on the RBF 
fc-NN graph with nearest neighbor parameter fc and RBF parameter a. The 
optimization problem is non-convex but involves search over a small number 
of parameters. We discretized the parameters in our experiments. We varied 
fc in {20, 30, 100}. For the RBF parameter tr it has been suggested that it 
should be of the same scale as the average fc-NN distance dk [5] . This suggested 
a discretization of a as 2^dk with j — —4, —3, . . . , 4. We discretized A G [0, 1] 
in steps of 0.2. Notice that for A = 1, RMD graph is identical to fc-NN graph. I 
is set identical to fc. We assume meaningful clusters are at least 5% of the total 
number of points S = 0.05. We set the GTAM parameter fj, = 0.050 for the SSL 
applications. For each SSL run 20 randomly labeled samples are chosen with at 
least one sample from each class. 

Varying Unbalancedness: We start with a comparison for 8vs9 of the 
256-dim USPS digit data set. We keep the total sample size as 750, and vary 
the unbalancedness, i.e. the proportion of numbers of points from two clusters, 
denoted by ng, ng. FigIS] shows that as the unbalancedness increases, the perfor- 
mance severely degrades on traditional graphs, while our method can adapt the 
graph-based learning algorithms to different levels of unbalancedness very well. 

Other Real Data Sets: We apply SC and SSL algorithms on several other 
real data sets including USPS, waveform database generator(21-dim), Statlog 
landsat satellite images (36-dim), letter recognition images(16-dim) and optical 
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Unbalancedness (n /n ) 



(a) SC on USPS 8vs9 



Unbalancedness (ng/n ) 



(b) GTAM on USPS 8vs9 



Fig. 5. Error rate performance of SC and GTAM on 8vs9 of USPS digit dataset with 
varying levels of unbalancedness. We omitted GRF since the results are qualitatively 
similar. Notice that not only A but also k, a have been optimized. Our method adapts 
to different levels of unbalancedness much better than traditional graphs. 



Error Rates(%) 


USPS 


Satlmg 


OptDigit 


Letter Rec 


8vs9 


1,8,3,9 


4vs3 


3,4,5 


1,4,7 


9vs8 


6vs8 


1,4,8,9 


6vs7 


6,7,8 


RBF fc-NN 


16.67 


13.21 


12.80 


18.94 


25.33 


9.67 


10.76 


26.76 


4.89 


37.72 


RBF 6-matching 


17.33 


12.75 


12.73 


18.86 


25.67 


10.11 


11.44 


28.53 


5.13 


38.33 


fuU-RBF 


19.87 


16.56 


18.59 


21.33 


34.69 


11.61 


15.47 


36.22 


7.45 


35.98 


RBF RMD 


4.80 


9.18 


7.87 


15.26 


19.72 


5.43 


6.67 


21.35 


2.92 


28.68 



Table 1. Error rate performance of Spectral Clustering on various graphs for unbal- 
anced real data sets. Notice that not only A but also k, a are optimized. Our method 
performs significantly better than other methods. 



recognition of handwritten digits(64-dini) [W. We fix 150/600, 200/400/600, 
200/300/400/500 samples for 2,3,4-class cases, with corresponding orders of class 
indices listed in TabHOI Tab ITHl shows that even when k and a for RBF fc-NN(6- 
matching) and full-RBF graphs are optimized to achieve optimal performance, 
RMD graph still consistently outperforms other methods. 

5.3 Applications to Small Cluster Detection 

We illustrate how our method can be used to find small-size clusters. This type of 
problem arises in community detection in large real networks, where graph-based 
approaches are popular but small-size community detection is difficult |17| . 

Our synthetic dataset depicted in Fig. [5] has 1 large and 2 small proximal 
Gaussian components along xi axis: cti^ifJ-ij where ai : a2 ■ — 2 : 

8 : 1, Aii = [-0.7;0], /Z2=[4.5;0], /i3=[9.7;0], Si ^1,^2^ diag{2, 1), S3 = 0.7/. 

FiglZja) shows a plot of cut values for different cut positions averaged over 
20 Monte Carlo runs. We note that the cut-value plot resembles the underlying 
density. Two density valleys are both at the unbalanced positions. The rightmost 
cluster is smaller than the left cluster, but has a deeper valley. 
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Error Rates(%) 


USPS 


Satlmg 


OptDigit 


LetterRec 


8vs6 


1,8,3,9 


4vs3 


1,4,7 


6vs8 


8vs9 


6,1,8 


6vs7 


6,7,8 




RBF fc-NN 


5.70 


13.29 


14.64 


16.68 


5.68 


7.57 


7.53 


7.67 


28.33 


GRF 


RBF 6-matching 


6.02 


13.06 


13.89 


16.22 


5.95 


7.85 


7.92 


7.82 


29.21 


fuIl-RBF 


15.41 


12.37 


14.22 


17.58 


5.62 


9.28 


7.74 


11.52 


28.91 




RBF RMD 


1.08 


10.24 


9.74 


15.04 


2.07 


2.30 


5.82 


5.23 


27.24 




RBF fc-NN 


4.11 


10.88 


26.63 


20.68 


11.76 


5.74 


12.68 


19.45 


27.66 


GTAM 


RBF 6-matching 


3.96 


10.83 


27.03 


20.83 


12.48 


5.65 


12.28 


18.85 


28.01 


fuU-RBF 


16.98 


11.28 


18.82 


21.16 


13.59 


7.73 


13.09 


18.66 


30.28 




RBF RMD 


1.22 


9.13 


18.68 


19.24 


5.81 


3.12 


10.73 


15.67 


25.19 



Table 2. Error rate performance of GRF and GTAM on various graphs for unbalanced 
real data sets. Notice that not only A but also k, a are optimized to achieve best 
performance. Our method performs significantly better than other methods. 

5 = 0.2 

To apply our method we vary the 
cluster-size threshold 6 in Eq.®. We 
can now plot the Cut-value against 
6 as shown in FiglTl^b). As seen in 
FiglT^b), when S > 0.3, the optimal 
cut is close to the valley. However, 
since the proportion of data samples 
in the smaller clusters is less than 
30% we see that the optimal cut is 
bounded away from both valleys. As 
6 is further decreased, namely, in the 
range 0.25 > S > 0.15, the opti- 
mal cut is now attained at the left 
valley(a;i « 1.8). An interesting phe- 
nomena is that the curve flattens out 
in this range. This corresponds to the 
fact that the cut value is minimized at 
this position {xi = 1.8) for any value 
oi S G [.15, .25]. This flattening out can happen only at valleys since valleys rep- 
resent a "local" minima for the optimization step of Eq. [S] under the constraint 
imposed by S. Consequently, small clusters can be detected based on the flat 
spots. Next when we further vary S in the region 0.1 > 6 > 0.05, the best cut 
is attained near the right and deeper valley(a:i « 8.2). Again the curve flattens 
out revealing another small cluster. 

5.4 Comments on RMD Method 

Tuning Parameters: We first describe parameters involved in our RMD method. 
We have already pointed out that A is a parameter that is optimized and so does 
not count as a tuning parameter. So we are left with parameters I and S. As we 
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Fig. 6. Gaussian mixture with three unbal- 
anced Gaussian components. Results of our 
method is depicted for a single realization. 
Our method is able to discover two small 
clusters. The larger cluster is detected for a 
larger value of S and the smaller cluster is 
detected for a smaller S value(see Eq. 
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!< ( cut position ) 5 (cluster-size threshold) 



(a) Cut value vs. cut position (b) Cut value vs. Cluster size((5) 

Fig. 7. 2-clustering results of 1 large and 2 small proximal gaussian mixture compo- 
nents. Both valleys are at unbalanced positions. The rightmost cluster is smaller than 
the left cluster with a deeper valley, n = 1100, binary weight is adopted. Cut values in 
(a) are averaged over 20 Monte Carlo runs. Results in (b) are from one run. By varying 
cluster-size allowance threshold 5, our method is able to detect different small clusters 
and generate meaningful cuts. 

pointed out in Sec. 3 the choice of S is based on our prior or desire to find clusters 
that are sizable, say 5% to 10% of the data. This leaves the choice to a single 
tuning parameter, namely, I. Our method appears to be relatively insensitive to 
choice of Note that unlike k and a, which are used for graph construction, 
the parameter I here is primarily used to relatively order data points based on 
whether they belong to high-density or low-density regions. In most situations 
we have encountered this ranking does not substantially change, namely, it is 
rarely the case where an empirically low ranked data point should have a high- 
rank (i.e. high-density region). Similar results have also been observed in the 
context of high-dimensional anomaly detection |18I19] . 

Time Complexity: The time complexity of U-statistic rank computation 
is 0{Bdn^logn), and RMD graph construction is 0{dn?logn), which leads to an 
aggregate complexity of O [{B + l)dv?logri) . In experiments we set i? = 5, so 
the complexity is on the same order of constructing a fc-NN graph(0((in^Zogn)). 

6 Conclusions 

We have demonstrated that spectral clustering and graph based semi-supervised 
learning algorithms can fail on conventional graph methods for unbalanced and 
proximal data clusters. We propose a systematic procedure for graph construc- 
tion (RMD graph), based on adaptive sparsification and densification of neigh- 
borhoods of fc-NN graphs. Our method effectively incorporates density, main- 
tains robustness to outliers, and adapts to different degrees of unbalancedness. 
We present a optimization framework for graph-based approaches, which allows 
for best sizable clusters separated by the smallest cut value. By constraining the 
smallest cluster sizes we can detect multiple small clusters and generate different 
meaningful cuts. Our simulations demonstrate significant performance improve- 
ments over existing methods for synthetic and real datasets. The ability to detect 
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small-size clusters (Figl?]) indicates that our idea may be utilized in other appli- 
cations such as community detection in large real networks, where graph-based 
approaches are popular but small-size community detection is difficult [17) . 
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Appendix: Proofs of Theorems 

For ease of development, let n — mi(m2 -I- 1), and divide n data points into: 
D = Do[j Di[j ...[jDmi, where Dq = {xi, ...,Xmi}, and each DjJ = 1, ...,rni 
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involves m2 points. Dj is used to generate the statistic G for u and Xj G Dq, for 
j = 1, ...,TOi. Dq is used to compute the rank of u: 

^ mi 

R{u) = — XI kG{xr,D,)>G(u,D,)} (15) 

We provide the proof for the statistic G{u) of the following form: 

1 n\-^ 

G{u-D,) = - • (16) 

where D(^^^(u) denotes the distance from u to its i-th nearest neighbor among 
7712 points in Dj. Practically we can omit the weight as Eq.© in the paper. The 
proof for the first and second statistics can be found in [18] . 
Proof of Theorem 1: 

Proof. The proof involves two steps: 

1. The expectation of the empirical rank E is shown to converge to p{u) 
as n — oo. 

2. The empirical rank R{u) is shown to concentrate at its expectation as ti — ^ 
oo. 

The first step is shown through Lemma [2J For the second step, notice that the 
rank R{u) = J2T=i '^^^^re Yj = l{G{xj;Dj}>G(u;D,y} is independent across 
different j's, and Yj E [0, 1]. By Hoeffding's inequality, we have: 

P {\R{u) - E I > e) < 2 exp {-2mie^) (17) 

Combining these two steps finishes the proof. 

Proof of Theorem 2: 

Proof. We only present a brief outline of the proof. We want to establish the 
convergence result of the cut term and the balancing terms respectively, that is: 

^^cuUS) ^ C,J^f^-Hs)pi.s)^+id.s. (18) 

where V^{V^) ^ {x e V : x e C+(C")} are the discrete version of C+(C^). 
Ea. ([T5|) is established in two steps. First we can show that the LHS cut 

term converges to its expectation E (^i^^ ^J^cutn{S)j by making use of the 

McDiarmid's inequality. Second we show that this expectation term actually 
converges to the RHS of Eq. p^ . This is the most intricate part and we state it 
as a separate result in Lemma [TJ 

For Eq.dTO]), recall that the volume term of is vol{V^) — ^uev+,vev 

l.It 

can be shown that as n — >■ oo, the distance between any connected pair (m, v) goes 
to zero. Next we note that the number of points in V'^ is binomially distributed 
Binom{n, /i(C"*')). Using the Chernoff bound of binomial sum we can show that 
almost surely Equation [T^ holds true. 
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Lemma 1. Given the assumptions of Theorem 2, 
E 



JL .f^cut,,{S)] -^Cd [ f~--{s)p{sY+-^ds. 
nk„ \ kn J Js 



(20) 



where Cd 



{d+lWd 



Proof. The proof is similar to [3] and we provide an outline here. The first trick 
is to define a cut function for a fixed point Xi S , whose expectation is easier 
to compute: 



CUtxi = ^ w{xi,v). 



(21) 



Similarly, we can define cutx^ for Xi . The expectation of cutxi and cutn{S) 

can be related: 



E(cMt„(S')) = nE^(E(cMi^)) 
Then the value of E(cut2;.) can be computed as, 



(n-1) 



f{y)dy 



B{x,,r)nC- 



dFmXr). 



(22) 



(23) 



where r is the distance of Xi to its knp{xi)-th. nearest neighbor. The value of r 
is a random variable and can be characterized by the CDF Fj^k (r) . Combining 
equation [22] we can write down the whole expected cut value 



V{cutn{S)) ^ nEx{]E.{cutx)) f{x)E{cutx)dx 



n(n-l) / fix) 



g{x,r)dFjik (r) 



L"'0 



dx. 



To simplify the expression, we use 17(2;, r) to denote 



gix,r) = {/ 



B(x,r)nC- 



f{y)dy,xeC+ [ 
Je 



B{x,r)nC+ 



f{y)dy,xeC- 



(24) 
(25) 

(26) 



Under general assumptions, when n tends to infinity, the random variable r 
will highly concentrate around its mean E(r^'). Furthermore, as fc„/n — ^ 0, E(r^) 
tends to zero and the speed of convergence 



E{r'^)^{kpix)/{{n-l)f{x)vd)y/'' 



(27) 



So the inner integral in the cut value can be approximated by g{x, E(r^')), which 
implies. 



E{cutn{S)) K Ji{n - I) / f{x)g{x,E{r^))dx 



(28) 
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The next trick is to decompose the integral over R"* into two orthogonal 
directions, i.e., the direction along the hyperplane S and its normal direction 
(We use ft to denote the unit normal vector): 



+ 00 



f{s + tlt)g{s + tlt,¥.{r]^^^))dtds. (29) 



When t > E(r'^^^-^), the integral region of g will be empty: B{x, E(r^'))nC ~ 0. 
On the other hand, when x = s+tlt is close to s g S*, we have the approximation 
fix) « fis): 



f{s + tlt)g{s + tit, lE(r^-^^^ ))dt 
. 2jjJ=('-')/(s) [f{s)vol {B{s + tlt,Er''^)nC-)] dt 
= 2/2(s) /^"^("'^ vol {B{s + tTt,E(r,*)) n C') dt. 



(30) 
(31) 

(32) 



The term vol {B{s + tli ,E{r^)) n C^) is the volume of d-dim spherical cap 
of radius £(7-^ )), which is at distance t to the center. Through direct computation 
we obtain: 



vol {B{s + ilt, E(r^^)) n C") = E(r^) 



k\d+l Vd-l 



1 



(33) 



Combining the above step and plugging in the approximation of E(7'g ) in Eq.([27]), 
we finish the proof. 

Lemma 2. By choosing I properly, as m2 — > 00, it follows that, 

\E[R{u)]~p{u)\ — ^ 
Proof. Take expectation with respect to D: 



Ed [Riu)] = En\Do 



'-Da 



^ nil 

— '^hG(u;D,)<G(x,-D,)} 



^ nil 

— ^E^^ [Ed, '^{G(u;D,)<G(xy,D,)}\] 



Ix [Vdi iG{u;Di) < Gix-Di))] 



(34) 

(35) 
(36) 



The last equality holds due to the i.i.d symmetry of {xi, a;„ii} and Di, ■ 
We fix both u and x and temporarily discarding Ed-^. Let Fx{yi, .■.,ym2) ~ 
G{x) — G{u), where yi, ^'^^ the 7x12 points in Di. It follows: 

Vd, {G{u) < G{x)) = Vdi {FAyi, y™J > 0) = T^zj, {F, - E^^, > -E^^,) . 

(37) 
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To check McDiarmid's requirements, we replace yj with y'j. It is easily verified 
that \fj = 1, 

|F,(yi,...,2/„J-f^,(yi,...,y^-,...,y„,J| < < ^ (38) 

where C is the diameter of support. Notice despite the fact that yi, ...,2/m2 are 
random vectors we can still apply MeDiarmid's inequality, because according to 
the form of G, Fx{yi, ymj) is a function of m2 i.i.d random variables ri, 
where is the distance from a; to y^. Therefore if ¥.Fx < 0, or EG(x) < EG(u), 
we have by McDiarmid's inequality, 

8G2m2 
(39) 

Rewrite the above inequality as: 



Vd, {G{u) < G{x)) = Vm {F^ > 0) = Vd, (F, - EF, > -EF,) < exp 







I{Ef.>o} - e < T'lj, (F, > 0) < I{Ef,>o} + e (40) 

It can be shown that the same inequality holds for KF^ > 0, or EG(a;) > EG(u). 
Now we take expectation with respect to x: 



(EF, > 0)-E, 



< E [Vd, (F, > 0)] < (EF, > 0)+E, 



(41) 

Divide the support of x into two parts, Xi and X2, where Xi contains those x 
whose density f{x) is relatively far away from f{u), and X2 contains those x 
whose density is close to f(u). We show for a; G Xi, the above exponential term 
converges to and V (EF^ > 0) = Vx (/(") > fi^)), while the rest a; G X2 has 

very small measure. Let A{x) = ( /(xfc^m-, ) ■ Lemma [3] we have: 
\EG{x)-A{x)\<-f[ — A(.t)<7' > ' > / Ti W 



m-2/ ' V"^2/ \frmnCdrrL2J KcJ''^ ) \'™2 

(42) 

where 7 denotes the big O(-), and 71 = 7 ( y ^. J . Applying uniform bound 
we have: 

A{x)-A{u)-2 [^^^ (i) ' ^ ^ [G(-) - G{u)] < A{x)-A{u)+2 [^^^ [±- 

(43) 

Now let Xi = {a; : \ f{x) - f{u)\ > 3jidf^^ (;^) For x G Xi, it can be veri- 
fied that |A(a:)-A(^)| > 3 (^4^) (_L)%r|E[G(a;)-G(^)]|> (^^) 
and /(x)} = I{EG(x)>EG(u)}- For the exponential term in Egu. ipn]) we have: 
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For a; e X2 = {x : \f{x) ~ f{u)\ < 871^ (^^^ ' /J^}, by the regularity as- 

— j /,„^„ . Combining the two cases into 

Eau. (j4ip we have for upper bound: 

Ez5 [R{u)] = [Pd, {G{u) < G(x))] (45) 
{Giu) < G{x)) f{x)dx + I Vd, {G{u) < G{x)) f{x)dx (46) 



< -P, (/(«) > fi^)) + cxp — ^\ ' , r{x G Xi) + vix e m) 

V V SG^c'^ml'^^ 



d + 1 



< V. ifiu) > fix)) + exp ^\ ' , + iM^idf.J,„ [ — ) (48) 

Let I — such that 2d+i < 01 and the latter two terms will converge to 
as 7712 ^ 00. Similar lines hold for the lower bound. The proof is finished. 



Lemma 3. Let A{x) — I ^tt-^ ) , Ai = ( — j^— ) • By choosing I ap- 



l/d , \l/d 
I \ _ A ( l.F ^ ' 

propriately, the expectation of l-NN distance EZ?(;)(x) among m points satisfies: 

\ED^ij{x) - A{x)\ = O (^A{x)X, (^^^ ^ ^ (49) 

Proof. Denote r{x,a) = min{r : V {B{x,r)) > a}. Let (5™ — ?► as m — > 00, and 
< < 1/2. Let J7 ~ J5zn(TO, (l-|-(5m);^) be a binomial random variable, with 
EU ={1 + 6m)l. We have: 



r (^D^i){x) > r{x, (1 + S,n)^)^ =V{U <l) 



(50) 

= V[U<[l-^)il^8„.)^ (51) 

^-p(-2(ifb) ^''^ 

The last inequality holds from Chernoff's bound. Abbreviate ri = r(x, (1 -I- 
5m) ^), and E£)(/)(a;) can be bounded as: 

(x) <ri[l-V {D^i) (x) >n)]+ CV (x) > n) (53) 

<ri+Ce^^(--p^\ (54) 



2(1 -f 5m) 

where C is the diameter of support. Similarly we can show the lower bound: 

ED^i){x) > r{x, (1 - 5m)^) - Cexp (^-^JM^.^ (55) 
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Consider the upper bound. We relate ri with A{x). Notice V {B(x, ri)) = (1 + 

Sm):^ > Cdrffrnin, SO a fixed but loose upper bound is ri < ( j^^^n^ ) = 
^max- Assume l/m \s sufficiently small so that ri is sufficiently small. By the 
smoothness condition, the density within B{x, r{) is lower-bounded by f{x)—Xri , 
so we have: 

V{B{x,n)) = {l + dm)- (56) 
m 

>Cdrf{f{x)-Xn) (57) 
= carffix) (^1 - -^n) (58) 



fix) 
A 



> Cdvffix) (l - 7^ 

\ Jmin 



' max 

mzn 



(59) 



That is: 



l + Sr, 



n < A{x) I , 1 (60) 



1 

J rr. 



i/d 



Insert the expression of rmax and set Ai — \cjr~ ) ' ^® have: 



^^W ^-'^-'-'r .Cexpf-^) (63) 



0|'4W-^i(;J;)"'') (64) 

3d+8 1 

The last equality holds if we choose I = and 5m = m~^. Similar lines 

follow for the lower bound. Combine these two parts and the proof is finished. 



